##
Read 20.7% of 1304346 rows
Read 41.4% of 1304346 rows
Read 57.5% of 1304346 rows
Read 81.3% of 1304346 rows
Read 1304346 rows and 19 (of 19) columns from 0.240 GB file in 00:00:07
## cmte_id cand_id cand_nm
## Length:1304346 Length:1304346 Length:1304346
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## contbr_nm contbr_city contbr_st
## Length:1304346 Length:1304346 Length:1304346
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## contbr_zip contbr_employer contbr_occupation
## Length:1304346 Length:1304346 Length:1304346
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## contb_receipt_amt contb_receipt_dt receipt_desc
## Min. :-10500.0 Length:1304346 Length:1304346
## 1st Qu.: 15.0 Class :character Class :character
## Median : 27.0 Mode :character Mode :character
## Mean : 116.2
## 3rd Qu.: 88.0
## Max. : 10800.0
## memo_cd memo_text form_tp
## Length:1304346 Length:1304346 Length:1304346
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## file_num tran_id election_tp
## Length:1304346 Length:1304346 Length:1304346
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## cmte_id cand_id cand_nm
## Length:1287336 Length:1287336 Length:1287336
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## contbr_nm contbr_city contbr_st
## Length:1287336 Length:1287336 Length:1287336
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## contbr_zip contbr_employer contbr_occupation
## Length:1287336 Length:1287336 Length:1287336
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## contb_receipt_amt contb_receipt_dt receipt_desc
## Min. : 0.01 Length:1287336 Length:1287336
## 1st Qu.: 15.00 Class :character Class :character
## Median : 27.00 Mode :character Mode :character
## Mean : 120.46
## 3rd Qu.: 97.62
## Max. :2700.00
## memo_cd memo_text form_tp
## Length:1287336 Length:1287336 Length:1287336
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## file_num tran_id election_tp
## Length:1287336 Length:1287336 Length:1287336
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
This is an exploration of 2016 US presidential campaign donations in the state
of California. The dataset contains financial contribution transaction
from 2013 to 2016.
Observing the data, I found that some of the data of contribution amount is
negative, which I believed was refund. Some other data of contribution amount
is or more than $2700, (the limitaition of the donation), which will be
refunded. Therefore, I cleaned the data by omitting these invalid data.
## NULL
Analyzing the distribution of contribution amount (contbr_receipt_amt
in the dataset), it shows that the median of one-time donation is $27,
while the most frequent amount of donation is $25, followed by $50 and $100.
Most of the contribution record is no more than $200.
To clean the data related to contributing city, I only kept the records of
column contbr_city in original dataset that begining with alpha and without
digit or other character like “#”, “.” and so on. Observing the data of
contributing city, I got the top 10 cities that have most contribution records.
The bar chart illustrates that 97766 contributions made in Los Angeles,
followd by San Francisco (86453) and San Diego(43720).
## Warning: Ignoring unknown aesthetics: show.legend
## Warning: Ignoring unknown aesthetics: show.legend
Analyzing the data related to candidate (variable: cand_nm), there are 25
candidates attending the presidential election; however the contribution they
got was significantly different. The bar chart illustrates the top 5 candidates
who had most contribution records in dataset. Obviously, Hillary Clinton won
the first place, followed by Bernard Sander and Donald Trump.
The Candidate’s party affiliation is an important feature for the results of
election, which is not included in the original dataset. Therefore, I added an
extra variable “cand_party” to the dataset representing the candidate’s party
affilliation. Then, I found that there were 5 candidates from Democratic party,
17 candidates from Republican party and 3 candidates from other party.
I found the top five candidates who had most contribution records and ploted
a bar chart. It shows that Hilliary Clinton ranked first place with 655622
records, followed by Bernard Sanders with 387451 records and Donald Trump
with 80714 records.
## Warning: Ignoring unknown aesthetics: show.legend
Male and female may have different opinions when selecting the president, so I
added variable “gender” to the dataset by applying function gender(), which
can predict gender of the donor with their first name. The data shows that
in California contribution records by female is nearly 16% more than those
by male.
Focusing on the data related to occupation, through the bar chart, we can see
that retired people contributed the most to the election, followed by people
who were not employed and attorney. Specifically, the number of contribution
records by retired people is more than double of it by people who were not
employed.
To analyze when did people donate their money, I added a new variable
“contbr_year” to see how many contribution records were in each year
from 2013 to 2016. The line chart illustrates that people tend to donate money
when the election was getting finished. Only one record took place in 2013,
while the number increased to 133562 in 2015 and then growed dramatically
in 2016 to 1104701, more than 8 times of the number in 2015.
First, for the dataset of 2016 US presidential campaign donations in
the state of California, there are 1304346 contributions and 18 variables.
The variables that I think might be useful to explore potential knowledge of
data are as follows:
cand_nm: Candidate’s name
contbr_nm: Contributor’s name
contbr_city: The city that contributor lives in
contbr_occupation: Contributors’ occupation
contb_receipt_amt: The amount of contribution
contb_receipt_dt: Contribution date
Other observations of this dataset:
One may donate multiple times, so that is why in this section I only
explored the contribution records related to the variables mentioned above.
The median of one-time contribution is $27; and the minimum donation is
$0.01, while the maximum donation is $2700. (except those who donated more
than the limitation $2700)
The democratic party only has 5 candidates compared to 17 candidates
from the republian party.
People in Los Angeles contributed the most in California.
Contribution records related to female is 16 percentage more than male.
Retired people make the most amount of donations.
People tended to donate money near election closing time.
The main features in the dataset are candidate, candidate party affilliation,
contribution amount. I would like to explore potential knowledge of data and
I also think that a regression model of predicting which party a donor will
donate could be built with variables in the dataset and some additional
variables I can get.
In my opiniion, donor’s occupation and gender, contribution time and location
are likely to influence the amount of contribution and the party that a donor
supported. In my speculation, occupation is an important feature that can in
some degree determine the amount of contribution of someone and the
contribution party; location is another key feature that probably determine
the contriution party.
I created four new variables:
cand_party: candidates’ party affilliation
contbr_first_nm: contributor’s first name
gender: contributor’s gender
contbr_year: the contribution year
I found that some of the data of contribution amount is negative, which
I believed was refund. Some other data of contribution amount is or
more than $2700, (the limitaition of the donation), which will be refunded.
Therefore, I ommited these invalid data. When observing the relationship of
occupation and contribution, I found some records did not offer valid
occupation information like “INFORMATION REQUESTED”,
“INFORMATION REQUESTED PER BEST EFFORTS”,“N/A” or empty;
so I removed these records when analyzing data.
Since candidates’ party affilliation is a key feature to analyze, I explored
the relashionship between candidates’ party affilliation with other
valuable features. First, there are 17, 5 and 3 candidates from
Republican party, Democratic party and other party respectively.
Republican candidates occupy 68% of the total number of candidates;
however, the data shows that Demorcratic candidates obtained more donation
than other candidates even though Republican has quantity advantage.
The bar chart and pie chart illustrate that Democratic party got $110913477.09,
accounting for 74.77% of the total donation in California.
Observing the dataset, I found that one person may donated multiple times,
so I decided to consider the number of donor by party. The data shows that
Democratic party had 141935 donors, around 75% more than Republican party had.
## Warning: Ignoring unknown aesthetics: show.legend
## Warning: Ignoring unknown aesthetics: show.legend
Exploring the relationship between candidate and the amount of contribution,
as well as the number of donors, I found that candidates who got larger
amount of donation may had fewer donor. For example, through two bar charts,
we can see that Bernard Sanders ranks second place when considering the
donation each candidates got; however he had fewer donors than Donald Trump,
which means some of his donors may be richer than Donald Trump’s.
Furthermore, Marco Rubio had fifth supporters whereas the amount of donation
he got could not be within the top 5. Gererally, Hillary Clinton got
far more donation than other candidates with the figure of $90742475.97,
occupying about 61% of the total donation of California; and she also had
the most donors amoung all the candidates.
In previous section, I mentioned that female had more contribution records
than male; however, I found that male in fact donated more money than female
with the total contribution of 78363032.26 dollars and 69977401.22 dollars
respectively. Besides, the average donation by male and female is around
697 and 651 dollars respectively.
Donors’ occupation is also a key feature which influences the amount of
contribution. I explored top 10 occupations that donated most in the
presidential election. It illustrates that retired people contributed the most
with 24529708.58 dollars, accounting for nearly 43% of the total, followed by
attroney ($7996430.69) and people who were not employed ($6192383.72).
However, the average donation of these ten occupations is a different picture.
Presidents and CEOs had a high average donation amount with around $1509 and $1501 respectively, ranking the first two places, while homemakers won the
third place with the figure of $1119. Not surprisingly, although retired people
and people who were not employed made a huge total amount of donation,
the average of the donation amount is the lowest amoung these ten occupations.
The trend of donation amount by year is similar with the trend of the number of
donation records. Most of donations took place in 2015 and 2016.
In this section, I mainly investigated the relationships between those important
features mentioned in previous section with the main featrue –
the amount of contribution. My findings are as follows:
$110913477.09 was donated to Democratic party, accounting for
74.77% of the total donation in California, which is also 3 times of
the figure for Republican party.
The Majority of contributions are recieved by a few candidates.
Hillary Clinton far more donation than other candidates with the figure of
$90742475.97, occupying about 61% of the total donation of California;
and she also had the most donors amoung all the candidates.
In California, male contributed more than female.
Retired People donated the most (nearly 43% of the total) amoung all
occupations even though the average of donation is low.
Most of the donation took place in 2015 and 2016.
There are 17 candidates from Republican party while only 5 candidates from
Democratic party; however in California, nearly 3 quarter of the donation is
for Democratic party. In addition, candidates who got larger amount of donation
may had fewer donor. For example, observing the data, I found that
Bernard Sanders ranked second place when considering the donation that
each candidates got; however he had fewer donors than Donald Trump,
which means some of his donors may be richer than Donald Trump’s.
Furthermore, Marco Rubio had fifth supporters whereas the amount of donation
he got could not be within the top 5.
Democratic party got more donation and supporters than Republican party and
other parties in California. Retired people contributed most total amount of
donation to the election, and this group of people focused on this campaign
most.
Observing the exact amount of donation to each party
(not consider “Others” party) made by top10 occupations mentioned before,
I found that all the people from differnt occupations donated more to
Democratic than to Republic. Furthermore, for Democratic party, the top three
donation occupations are retired people, attorneys and not-employed people;
however, for Republican party, retired people, honemakers and CEOs made
most contribution. What is worth to mention is that people
who were not employed donated 6166751.54 dollars to Democratic party
but only donoted 25632.18 dollars to Republican party.
What is interesting is that this situation happens to Hillary Clinton
(a comptetitive candidate from Democratic party) and Donald Trump
(a comptetitive candidate from Republican party).
Although contribution records started in 2013 in dataset, they are still so
few comparing to year 2015 and 2016. Therefore, I only focus on these two years
and drew a bar chart showing contribution amount of top 5 candidates by year
as well as a boxplot illustrating the contribution distribution of top 5
candidates by year. I get following information:
All the candidates except Marco Rubio raised more money in 2016 than in 2015.
Hillary Clinton got more than 3 times of donation in 2016 comparing with 2015,
which also happens to Donald Trump.
There are many outliners when considering the one-time contribution amount.
It seems that people who donated for Marco Rubio in 2015 are mush richer
than people who donated for other candidates.
People who made contribution came from different income-level groups.
People tend to contribute less money one time in 2016 than in 2015.
The bar chart above shows the amount of donation that top 5 candidates got by
female and male respectively. It is reasonable that only Hillary Cliton
got more donation from female than male even though generally males donated
more than females.
More big pocket donors supported Hillary Clinton when the election was going
to be closed.
The majority of not-employed people supported Democratic party,
and I speculated that they in fact surpported Hillary Clinton since most of
the donation to Democratic party was for her.
In California, people no matter from which occupation, tend to support
Democratic party, especially Hillary Clinton.
For a certain period of time, Bernard Sanders received more donations and
gained more popularity than Hillary Clinton.
The graph above shows the contribution amount distribution, which is a
general picture of the whole dataset since the amount of contribution is
one of the main features that we should be concerned about.
We can see through the graph that most one-time contribution(or we can say
most of people’s donation) is no more than 200 dollars.
The bie chart above illustrates the percentage of total donation that
each party got. It is not surprising that in California, Democratic party
took the majority of the donation, accounting for 74.86% of the total,
while Republican party only raised 24.31% of the total donation.
The bar chart above illustrates the amount of donation for Hillary Clinton,
Bernard Sanders and Donald Trump by top 10 occupations. Retired people
contributed most to the presidential election, both Hillary Clinton and
Donald Trump got the majority of donation from them. Apart from retired people,
attorneys and homemakers also donated a large amount of money to
Hillary Clinton; however, the money that Donald Trump rasied mostly
comes from CEOs and presidents. What is worth to mention is that
not-employed people donated second most amount of donation amont all the
occupations, and most of them to Bernard Sanders.
During this project, my challenges and struggles are as follows:
I removed some contributions that exceed $2700 or are negative becasue of
the “Contributes Limits” on website.
The original dataset did not provide donors’ gender information;
however intuitively, the gender of donors and candidates can influence the
contribution in some degree. Therefore, I applyed R’s gender package to
predict gender of donors by their first name.
Sometimes, to make the graphs look beautiful and clear, I need to search
many extra functions that our course may not mention.
The ggplot2 and dplyr packages are the most important packages for this
project. I also learned gender package and lubridate package that
I think is useful.
I am new to R, so through this project, I learned a lot of new things
including draw statistical graphs and charts by R, and I also gain experience
of exploring potential knowledge from dataset.
By exploring California financial donation data, I found several interesting
characteristics:
There is no doubt that California is one of the bluest states.
Only few candidates raised the most donations.
Female tend to donate more to candidates from Democratic party,
especially female candidate.
Retired people are the largest contribution group, followed by people
who were not employed, attorneys.
Los Angeles, San Francisco and San Diego are the top 3 cities that
contributed most amount of money.
Bernard Sanders gained the majority of the not-employed people’s donation.
Hillary Clinton got the most of the total donation in California.
My analysis is based on a state – California; however it would be interesting
and get more insights if exploring data based on candidates.I saw the dataset
based on candidates, by which I could investigate people’s support distribution
from different states. For example, analysis of swing states such as
Ohio and Florida must be beneficial to explore the reason why Donald Trump
beat Hillary Cliton in the end.
Though the election is over, Americans have seen the post-election surge (https://www.theatlantic.com/business/archive/2016/11/donald-trump-donations/507668/)
in donations. There will be more useful financial contribution data to analyze.